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' Abstract 
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An explicit algoritlim for tiie minimization of an li penalized least squares func- 
tional, with non-separable ii term, is proposed. Each step in the iterative algo- 
rithm requires four matrix vector multiplications and a single simple projection on 
1^ . a convex set (or equivalently thresholding). Convergence is proven and a con- 

a I vergence rate is derived for the functional. In the special case where the matrix in 

the ii term is the identity (or orthogonal), the algorithm reduces to the traditional 
iterative soft-thresholding algorithm. In the special case where the matrix in the 
i quadratic term is the identity (or orthogonal), the algorithm reduces to a gradient 

' projection algorithm for the dual problem. 

■ By replacing the projection with a simple proximity operator, other convex non- 

I separable penalties than those based on an ^i-norm can be handled as well. 
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1 Introduction 



^ . Non-smooth minimization problems involving a sum of a quadratic data misfit term and a 
. non-smooth penalty term have received a lot of attention in inverse problems and imaging 
in recent years. In this note we are interested in finding the minimizer x of the ii penalized 
least squares functional J-": 
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X = argminj^(a;) with J^(x) = — y|p + A|| 



by means of an iterative algorithm. Here = X^j""? with Wj e M and ||w||i = J2i 
{wi may be an element of R, R^, . . . and \wi\ stands for the Euclidean length of wf, other 
choices of \wi\ are discussed in section [H]). K is a matrix mixing the variables in the 
quadratic data misfit term and A is a linear operator mixing the variables in the penalty 
term. The quadratic term is convex and smooth, but the penalty term is convex 

and non-smooth. We work in a finite dimensional setting. 
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For the case where the non-smooth penalty term in ([T]) is simple {A = 1) many 
algorithms have appeared in recent years. One of the earliest (not necessarily the most 
efficient) is the iterative soft-thresholding algorithm [T] (see also section [3]). As £i-norm 
penalties promote sparsity, such algorithms are used in 'compressed sensing' [2] for finding 
a sparse solution (up to noise level) of a large-scale under-determined linear system. As 
problems in 2D and 3D imaging are large scale problems, with many unknowns, such 
simple first-order iterative algorithms can still be useful. 

The principal difference of this paper with respect to [1] is the presence of the matrix 
A in the penalty term. In image processing the total variation penalty, which favors 
piece-wise constant images, is popular for its ability to maintain sharp edges. The total 
variation penalty is defined by the £i-norm of the gradient of the unknown {A = grad). 
It has mostly been studied for denoising {K = 1) or for other special operators K (e.g. 
deconvolution). 

Our aim here is to provide a simple iterative algorithm for the problem ([T]) with proven 
convergence (see theorem [1]). We also desire an algorithm that is fully explicit: each step 
in the proposed iteration only uses four matrix- vector multiplications (one by K, K^, A 
and A'^) and a simple projection on the £oo ball (or equivalently a single thresholding). 

Although our main aim is to solve problem ([T]), we will formulate an algorithm and a 
convergence theorem for the more general problem: 

x = argmin J-'(x) with J-'{x) = -\\Kx — yW"^ + H{Ax), (2) 
X 2 

where if is a convex function (we assume that the solution to (|2]) exists). For problem ([2]) 
the projection operator mentioned before is replaced with the proximity operator prox^^, 
of the convex conjugate H* and soft-thresholding is replaced with the proximity operator 
prox^ of H. It is not necessary to know the proximity operator of H{A-). 

A second goal of the paper is to bridge the gap between the well-known iterative soft- 
thresholding algorithm (used for the special case A = 1) and the general case A ^ 1 in 
problem ([T]). The iterative soft-thresholding algorithm is well understood and has a. 1/N 
convergence rate for the decrease of the functional. It is also the basis of an accelerated 
algorithm with an improved l/N"^ rate of decrease of the functional [31 H]. The averages 
of the first iterates of the proposed generalized soft-thresholding algorithm are proven 
to have a 1/N rate on the functional. 

Our results differ from several existing algorithms for solving ([T]) where each itera- 
tion step requires either the solution of another (non-trivial) minimization problem, the 
solution of a linear system, or a non-trivial projection on a convex set. Our proposed 
algorithm may therefore be of use in cases where the matrices involved {K and A) have 
no special structure that makes such sub-problems easily solvable (i.e. not limited to 
deconvolution problems on regular grids, to orthogonal matrices, etc.). 

Iterative algorithms for the denoising case [K = 1) can, amongst others, be found in 
[5l [6]. For general K, an algorithm that uses a smoothing parameter is found in [7], an 
algorithm which needs a projection on a non-trivial convex set is in [8] and an algorithm 
which needs the solution of a non-trivial sub-problem is in [9l [TOl E] . These are results 
for A = grad but this is not essential in those algorithms. 

Zhu and Chan [T2] studied a primal-dual formulation and a so-called 'primal-dual 
hybrid gradient descent' (PDHG) algorithm but concentrated on deconvolution. Con- 
nections with (more general) algorithms for variational inequalities were mentioned. This 
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PDHG algorithm was placed in a general framework for primal-dual algorithms in fT3] and 
many interconnections can be found there. The plethora of algorithms mentioned there 
still require either the solution of a linear system (which may easy in some special cases) 
or the minimization of a non-trivial sub-problem. Applications to image recovery of an 
algorithm that is an instance of the so-called alternating direction method of multipliers, 
are tested in [M]. 

Recently an explicit algorithm was proposed in [151 equation 5.11] with proven con- 
vergence. No rate on the functional was given. That explicit algorithm is different from 
the one presented here. It does not reduce to the iterative soft-thresholding algorithm 
when A = 1. Another explicit algorithm can also be derived using fi6\ Eq. 74] by the 
introduction of additional dual variables. 

It remains a subject of study what speed increase can be gained (if any) from using 
an algorithm that solves a linear system at every iteration. The derivation of an (9(1/A^^) 
algorithm, if at all possible for this problem, would be more interesting. Our analysis and 
proof is inspired by fl7\ [T8] (who discuss a primal-dual algorithm for another problem) and 
by [IS]. It is worth pointing out that no smoothing parameter is introduced in the non- 
smooth part of the functional. The proposed algorithm is not an iteratively reweighted 
least squares algorithm. 



2 Mathematical tools 

We assume that the function H{x) in ([2]) and its convex conjugate H*{w) = sup^{w,x) — 
H{x) are two proper, lower semi-continuous, convex functions on a finite dimensional real 
vector space and with image in M U {+C)o} [T9]. For example in case of problem ([T]), 
H{u) = A||u||i and therefore 

H*M = l^ Mh-^ (3) 

^ ^ [ +0O ||w||oo > A ^ ^ 

such that A||n||i = H{u) = max^(tu, u) — H*{w) = sup||^||^<;^(w, u). 

The proximity operators [20] of the convex functions H and H* are defined as: 

(4) 



\w — u\ 



Therefore, when H{u) = A||m||i and H* is given by expression we find that prox^* (u) 
P\{u), the projection on the £oo ball of radius A. It has a simple explicit expression: 





u 




\u\ 


\u\ 


u 




\u\ 



Px{u) = l \u\ (5) 



(applied component-wise). On the other hand the proximity operator of H{u) = A||m||i is 
the so-called soft-thresholding operator, proXf^^u) = Sx{u). It has the explicit expression: 
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(also applied component-wise). Clearly, soft-thresholding Sx and projection Pa are con- 
nected by: 

Pxiu) = u-Sxiu). (7) 

In the formulas for Px{u) and Sx{u) u can be an element of M, M^, . . . depending on context 
(in particular {Ax)i G when A = grad of a 2D image). We shall use the same notation 
^A, -Pa when applied componentwise to a list of elements of M, M^, . . . 
Proximity operators are Lipschitz-continuous mappings |20j : 

\\pTOXjj,{u) — pTOXjj,{v)\\ < \\u — v\\ Wu,v. (8) 

The subdifferential dH{x) = {'y\H{y) > H{x) + {■j,y — x) Vy} of if in x can be 
characterized using the proximity operator of H. Indeed, from the definition (jl]) it follows 
that = pioXfj{u~) if and only if G dH{u^) + —u~ or —u^ G dH{u^). In other 
words, setting u = u~ — u~^, we have that u G dH{u^) if and only if = pT0Xfj{u^ + u). 

Finally, it can also be shown that the proximity operator of H and its dual H* are 
related by the following identity [2T] : 

pTOXjj*{u) + pTOXjj{u) = u, (9) 

as already verified in equation ([7j) for the special case prox^:^ = S'a and proxj^^, = Pa- 

We refer to [20] for a table of further examples and properties of proximity operators. 
The proximity operators proxj:^. = Pa and prox^:^ = S'a, that will be used for problem ([1]), 
i.e. for H{u) = A||m||i, have explicit expressions that are easy to implement. 

3 Variational equations and special cases 

The variational equations of the minimization problem ([2]) are: 

K^{Kx - y) + A^w = 0, 

where w is an element of the subdifferential of H{Ax). As mentioned before, this means 
that Ax = pTOXfj{w + Ax) or equivalently, using ([9]), that w = pY0Xfj*{w + Ax). The 
variational equations corresponding to the problem ([2]) are therefore: 

K'^{y — Kx) — A^w = and w = prox^^, (w + Ax) . (10) 

The goal of this paper is to write an iterative algorithm that converges to a solution of 
these equations. We assume that these equations have at least one solution {x,w). 

By using that H{Ax) = sup^{Ax, w) — H*{w), the minimization problem ([2]) can also 
be written as a saddle-point problem 

minmaxP(x, w), (11) 

X w 

where we have set: 

F{x,w) = ^\\Kx - yf + {Ax,w) - H*{w). (12) 

A saddle point (x, w) of f ITT]) is a point such that 

F{x, w) < F{x, w) < F{x, w) (13) 
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for all X and w. For completeness, we show in the next section that solutions (x, w) of 
equations ffTUl) are saddle-points of ffTTl) . We define the gap with respect to the saddle- 
point (x, w) by: 

G(x, w) = w) - w). (14) 

It follows from ( 1T3|) that this gap is non- negative for all x and tu. 
In the special case A = 1 the problem ([2]) reduces to: 

min — y|p + if(a;), (15) 
x 2 

for which a forward-backward splitting algorithm 

= prox^ (x" + K'^{y - iTx")) (16) 

can be used. This algorithm converges for \\K\\ < 72 [22]. More specifically, the mmi- 
mization problem with A = 1 and H{x) = A||x||i: 

min^lli^'x - + A||a;||i (17) 
X 2 

can be solved by the iterative soft-thresholding algorithm [1]: 

x^+i = Sx (x" + K^iy - Kx"")) . (18) 

Many other algorithms exist. One feature of this algorithm is that, as a consequence of 
the soft-thresholding, all the iterates x" (not just the limit) have many exact zeros. 
On the other hand, the problem 

min^llx - 5(f + Apxlli (19) 
X 2 

[K = 1, y g in problem ([T])) can be solved by a gradient projection algorithm: 

= + A{g - A^w")) (20) 

where x"' = g — A'^w'^, if ||y4|| < 1 (as is shown in [6], eqn. 11] for A = grad). This is a 
special case of the gradient projection algorithm that can be used for minimization of a 
quadratic function over a convex set C: min^g^; ll^? ~ The quantities Ax"' are not 

sparse in every step, only in the limit will Ax"" be sparse. 



4 Algorithm 

Writing the variational equations (fTOj) as fixed-point equations: 

X = X + K'^{y — Kx) — A^w 



/ 4 N (21) 

w = pioxjj* [w + Ax) , 

provides the usual ansatz for deriving iterative first order algorithms for ([2]). Here we 
choose to study the iteration 

= x" + K^{y - Kx") - A^w" 
^n+i ^ prox^. (w" + Ax"+^) (22) 



the fixed-point of wliicli is a solution to tlie variational equations (11 01) . Specifically, starting 
from {x"',w"') one does a gradient descent step on F{x,w) in the x-variable to arrive at 
{x"'~^^,w^), followed by a proximal ascent step in the w variable to compute w^~^^. Finally 
one does a gradient descent step in {x"', w^~^^) to arrive at (x""*"^, w""*"^). This algorithm can 
therefore be interpreted as a 'predict-correct' algorithm for the saddle-point problem ffTTl) . 
On the other hand the algorithm fl22l) can equivalently be written in a 'pseudo- implicit' 
form as: 

= - - «;"+^) 

^n+i ^ prox^, {w"" + (23) 

This form is useful for proving convergence. 
Writing the algorithm (1221) as: 

+ K^iy - Kx"") 
prox^. (u7" + A(c/"+i - A^tti")) (24) 

leads to the interpretation of a gradient descent step on the quadratic part of the func- 
tional, followed by a single step in a dual variable (compare with fl20|) ) starting from the 
previous dual variable w". 

In the next section we show that the proposed algorithm fl22|) converges to a solution 
of the fixed-point equations fl2Tl) . i.e. to a saddle-point of the min-max problem flTT]) and 
to a minimizer of the functional ([2]). Under some additional condition on H we also derive 
a convergence rate estimate for the functional J-" in the average of the iterates. 

For the special case when AA^ = A^A = 1, the second line of algorithm (122|) reduces 

to: 

= proxj^, (A(x" + K^{y - Kx"))) 

which implies: 

= x" + i^''^(y - iTx") - A'^prox^* (y4(x'^ + K^{y - Kx""))) . 

Using prox^(n) = u — pTox^,{u), one has: 

ar'^+i = A'^prox^((A(x" + ir^(?/ - iTx"))) . 

This is the forward-backward splitting algorithm f|T6|) for the variable Ax and the operator 
KA^. In particular, for H = A|| • ||i, the algorithm fl22l) reduces to the iterative soft- 
thresholding algorithm f|T8l) when AA-'" = = 1. Similarly, when K is orthogonal and 
-^(■) = ^11 ■ then the algorithm fl22]) reduces to 

r u^^+i = P^ti;" + A(ir^?/ - A'^w'^)) 
which is the gradient projection algorithm (I20p for the data g = K^y. 

5 Convergence 

We will prove convergence of algorithm fl22|) . 



W 



n+1 



X 



n+1 
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Lemma 1 If = pTOXfj*{w + A) then 

\\w-w+f < \\w-w-f - \\w- -w+f -2{w-w+,A)+2H*{w)-2H*{w+) (25) 
for all w. 

Proof: If = pTOXfj*{w~ +A) = argmin^ H*{w) + ^\\w — {w~ + A)\\'^ then w~ + A—w~^ G 
dH*{w^). We then have for all w that H*{w) > H*{w^) + (7,^ - w+) if 7 G dH*{w+) 
and therefore: 

H*{w) > H*{w~^) + {w- + A-w+,w -w^) 

= H*{w+) + {A,w - w^) + {w - w^,w - w^) 

= H*{w^) + {A,w - + l\\w' - if+p + l\\w - w^W^ - l\\w' - w 
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which gives □ 
Lemma 2 If = + A, then 

\\x — x^W'^ = \\x — x~ W'^ — \\x~ — x"*"!!^ — 2(x — x"*", A) (26) 

for all X. 

For completeness we show that a solution of the variational equations is a saddle-point of 
ffTTj) . This implies that the gap G{x,w) with respect to the fixed-point {x,w) is always 
non-negative. 

Lemma 3 // {x, w) satisfies the fixed-point equations i\21\} . then 

F{x, w) < F{x, w) < F{x, w) (27) 

and hence 

G{x,w) = F{x,w) - F{x,w) >Q (28) 

for all X and w. 

Proof: The first inequality F{x, w) < F{x, w) comes down to showing that < {Ax, w — 
w) + H*{w) — H*{w) for all w. This follows immediately from choosing w+ = w~ = w 
and A = Ax in lemma [TJ 

The second inequality F{x,w) < F{x,w) can be written as: 

< ^ ~ o 11-^^ ~ vW"^ + {A{x — x) , w) Wx. 

To show this we choose x'^ = x~ = x and A = K^{y — Kx) — A^w in lemma [2] to find: 

= -2{x - x,K'^{y - Kx) ~ A^w) 

= -2{K{x - x),y - Kx) +2{x - x,A^w) 

= —\\K{x — x)p — \\y — KxW^ + \\Kx — -|- 2{x — x, A'^w) 

for all X, or 

\\K{x-x)f = \\Kx-yf - \\Kx - yf + 2{A{x - x),w), 
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which is a shghtly stronger result than needed. □ 
The gap G{x, w) equals: 

G{x,w) = ^\\K{x-x)f + {w-w,Ax)+H*{w)-H*{w) (29) 

as can be verified from its definition (and lemma [2]). The sum of the last three terms on 
the right hand side is non-negative, so 

G{x,w)>^\\K{x-x)f. (30) 

The gap G{x,w) is not a measure of closeness of {x,w) to a saddle-point {x,w) as 
G{x,w) = does not imply that {x,w) is a saddle point. 

Lemma 4 If{x"',w'^) are given by iteration ^2^) then 

_ ^n+l||2 _j_ _ y^n+l||2 ^ || 3. _ 3;" || 2 _|_ || ^ _ || 2 

_||^n _ a,n+l||2 _ ll^n _ ^n+l||2 

-\\K{X - X")||2 + ||ir(x" - X"+1)||2 

-P^(w;-u;")f 

-F||A^(w;" - w;"+i)f + \\A^{w - w^+^)\\'' 
-2 (F(x"+\ w) - F{x, w"+i)) 

for all X and w. 

Proof: From lemmas [1] and [2] we find: 

||^_^n+l||2 < ||^_^n||2_ ||^n_^n+l||2_2^^_^n+l^^^n+l^ 

+2H*{w) - 2H*{w''+^) 
= ||x - llx" IP - 2(a; -a;"+\/s:^(l/ - iTa;") - A^w'^+i) 

which together yield: 

11^; _ 3;".+ ! ||2 _j_ _ ^n+l||2 ^ ll^; _ 2;"||2 _ 11^;"- _ 2;"+! ||2 

-2{w - w''+^ , Ax''+^) 

-2{x - K^{y - Kx"") - 

+2H*{w) - 2H*{w'^+^) 

As fl22l) implies x""'""'^ = x""'""'^ — y4-^(i(;" — w"^-'^), this can be written as: 

_ ^n+l||2 _|_ _ ^n+l||2 ^ || 3, _ 3,". || 2 _ || 2;". _ || 2 

+ \\w — — — 

-2{w - A(x"+i - A^iw" - W'^+^))) . 

-2(x - ir^(?/ - iTx") - 
+2i7*(w) - 2iy*(u;"+i) 



The two {w"^^^ , Ax"^^^) terms cancel: 

||x — x"^"'" IP + ||u' — < ||x — x^lp — llx"" — X"'"'""^|P 

+ ||w-w;"f - ||w"- 
+2{A^{w - w;"+i), A^(w" - w"+i)) 
-2(ir(x-x"+i),?/-irx") 
-2(Ax"+\ w) + 2(x, 
+2i7*(w) - 2if*(?i;"+i). 
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Now, by using the equalities: 

-2{K{x - x'^+^),y - Kx'^) = \\Kx - y\\^ - \\Kx''+^ - y\\^ - \\K{x - x^^f 

-2(Ax"+Sw) + 2(x,yl^w7"+^) = 2F(a;,w"+i) -2F(x"+\w;) - llfsTx-yf 

+ \\Kx^+^ - yf + 2H*{w'^+^) - 2H*{w), 

the previous inequahty reduces to: 

+\\w — w"!!^ — \\w^ — 

+ \\A^{w-w^+^)\\^ 
-\\K{x - a;")f + \\K{x'' - 
+2F(x, - 2F(x"+\ u;), 

which is the desired result. □ 

Theorem 1 Let \\K\\ < \/2 and \\A\\ < 1. If the equations ^^) have a solution and the 
sequence {x^,w^) is defined by the iteration 

^n+i = + x^{y - Kx'^) - A^w"" 

^n+i ^ prox^, {w'' + (31) 
^n+i = + K^{y - Kx'^) - A^w'^+^, 

then: 

1. the sequence {x"',w"') converges to a solution {x\w'') of the variational equations 
( fTZ^) thereby providing a minimizer of ^ and a saddle point of 07]) . 

2. the average of the first N iterates {x^,w'^) = J2iLi{^\ '^^) / ^ > converges to the 
saddle-point (x^, w^) and there exists a constant Ci > independent of N such that: 

») - < + ,32) 

for all X, w, (with Ci = if \\K\\ < 1), in particular: 

< Gd", i.") < lk^-^°f+M-^'T + C. . (33) 

3. If the dual variable w is bounded (H*{w) = +oo for \\w\\ > R for some R > 0), 
there exists a constant C2 independent of N such that: 

< J^(x^) - J^{x^) < C2/N \/N. (34) 
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Proof: i) Let {x,w) be a saddle point of f fTTj) . From lemma H] we find: 

_ 2;"+l||2 _|_ _ y^n+l||2 ^ _ 2;"||2 _ ||7^(^-£. _ 3;") ||2 

+ - - \\A^{w - 

+ - - \\K{x - a;"+i)|p 

where we have used relation ([30]) to set 2F(£, - 2F(x"+\ = -2G(a;''+\ < 

— \\K{x — x"'~^^)\\'^. Using the inequality 



-^\\K{x- 



we find: 

llx — x"'^^ P + ||w — IP < ||x — x^lp + ||w — — II A-^(w) — w") P 

As we assume that Hi^'H < a/2 and ||A|| < 1 we can introduce regular square matrices 
L and B by L^L = 1 - ^K^K and 5^5 = 1 - AA^ and deduce: 

\\x - x^'+^f + \\B{w - w''+^)f < ||x-x"f + ||5(w- 

-||L(a;" - a;"+i)f - ||5(?x;" - w"'+^)f. 

Summing from N to M > N, one also finds: 

\\x-x^'+'f+\\B{w-w^'+')f < Wx-x^'f + \\B{w-~w'')f 

- En=N - a;"+^)||' + II^K - w^"+')||'- 

(35) 

As B is invertible, it follows that the sequence is bounded. Hence there is a 

convergent subsequence {x""^ ,w"'^) "'^^ (x^w^) (the same subsequence for x" and w"). It 
also follows from inequality 0351) that: 

M 

||^(a;"-x"+^)f + ||5(w'^-w;"+^)f < ||x - x^f + ||5(^^; - . (36) 

n=N 

Hence ||L(x" — x"^^)|p and ||i?(w7"' — w"'''^)|p tend to zero for large n, which implies that 
||^n_^n+i|| ||i(;" — w"+^|| tend to zero. It follows that the subsequence {x^^'^'^ ,w"'^~^'^) 
also converges to {x\w'') and, by continuity of proxj^^*, that {x\w'^) satisfies the fixed- 
point equations fl2T|) . We can therefore choose {x,w) = {x^w"^) in relation fl35|) to find: 



X 



t _ a;M+i||2 ^ p(^t _ < ||xt - x^f + \\B{w^ - U7^)f (37) 



for all M > N. As there is a convergent subsequence of {x"',w"), the right hand side of 
this expression can be made arbitrarily small for large enough N {N = nj for some j). 
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Hence the left hand side will be arbitrarily small for all M larger than this A^. This proves 
convergence of the whole sequence to {x^w"^). 

ii) As {x^,w"') the Cesaro averages {x^,w'^) = J2n=i(^"' ^ ''^^) / 

converge to It follows from lemma H] that: 

Then, using convexity, one finds: 

N-l 



n=0 

123 1 ^ 



n=0 

Af-1 

n\ II 2 

y II :/: — :/: ii h- ii i '/;/ — /// 

2N 

-\\x - x''+'\\^ - \\B{w - w''+^)f 



n=0 

< ^(^\^-A\'+\\B{w-w')r 

N 
n=0 

If < 1 the summation on the right hand side can be dropped outright. If 1 < ||/^|| < 
a/2, the II x*^ — x'^^^lp terms can be dropped, and the sum Yln=o \\K{x"' — can be 

bounded by the series Yl'^=o ~ x*^"^^)!^. The latter converges as a consequence of 

relation fl5Bl) and the regularity of the matrix L. Therefore, a constant Ci independent of 
A^ (and equal to in case ||fC|| < 1), can be introduced such that 

Fii", uO - F(x, u,") < lk-^T + B^- ^°P + C. 

(where we have used ||i?|| < 1). 

Relation ( 133|1 follows from choosing {x,w) = (x"'', ty"'') in equation f l32|) . 
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iii) In case H* is such that the dual variable w is bounded [H* = +00 for all w with 
II It'll > R, for some i? > 0), we find 

< J^(£^) - J^(xt) = J^{x^) - F{x\ w^) 

lemma 13] , , , . , 

supF(x^,«;) 
= max w) — w^) 

||to||<_R 

<E2j lUt _ a;0||2 _^ _ y^0||2 _j_ 

< max 

||to|j<_R 2N 

C2/N 

which proves relation fl5^ . □ 



6 Discussion 

In case the conditions Hi^H < \/2 and/or ||A|| < 1 are not satisfied, it is possible to rescale 
the matrices, the data y and the variable w to write a convergent algorithm. In the special 
case of functional ([1]) it suffices to rewrite the problem equivalently as: 

m:m\\\{V^K)x-{V^y)r + ^\\{V^A)xh 

and use algorithm (l22l) with prox^^^. = Prxj^-, for the matrices \fTK^ \foA and the data 
a/t?/. Renaming w *r- A/aw/r and using the scaling property P^;),/^(m) = r / y/a P\{y/au / t) 
one finds the following iteration: 

(39) 

for problem ([T]). Now step size parameters a, r > should satisfy r < 2/||i^'"^if|| and 
a < l/\\AA^\\. In this case the bound (134|) is valid as the dual variable w is bounded (an 
element of the £oo ball of radius A). 

For the general case ([2]), the scaled version of the algorithm can be derived in a similar 
fashion. It takes the form 

= + TK^{y - Kx'^) - tA^w" 
^n+i ^ prox.^, («;" + a/rAx"+i) (40) 
^n+i ^ jJ^K^iy - isTx") - tA^w'^^^ 

with r < 2/\\K^K\\ and a < l/\\AA^\\. Here we have used that {tH{-/^))* = 
TH*{y/a/T ■) and proxj(aM) = a proxQ,-2 ^(q,.-) (m) for a > 0. 

It can be verified numerically that the functional J^{x'^) does not necessarily decrease 
monotonically as a function of n (this can be shown to hold in the special case A = 1 and 
H{u) = A||u||i, see [3]). The gap function G(x",w") does not decrease monotonically as 
a function of n either. The error between {x"',w"') and (x^,w^) decreases monotonically 
as a function of n in the norm (||xp + ||-Bwp)2. This is a consequence of relation (137|) . 



= x" + TK^{y - fsTx") - 
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The condition ||y4|| < 1 used in the proof of convergence excludes the case A = 1. 
Nevertheless, the proof of convergence in theorem [T] can be slightly adapted to cover the 
case A = 1 as well. 

The strength of algorithm (122|) lies in the fact that only prox^:^, is needed and not 
pioxjjf^^.y pioXfj, may have a simple expression whereas the proximity operator of H{A-) 
may not. In particular, for H = X\\ ■ one has expressions ([5]) and ([6]) for proxj^^* 
and proxj:^. One can also find closed-form expressions for proxj:^, when | ■ | (used in the 
expression = Yli refers to the 1- or oo-norms instead of the 2- norm. 

We believe the proposed algorithm fl39|) . its connection with the traditional iterative 
soft-thresholding algorithm and its proof of convergence are new. The combination of 
a gradient step with the dual algorithm fl20|) has been proposed several times already 
flU\ [TT] : as such that would not be an explicit algorithm as it requires infinitely many 
dual iterations in each outer iteration. Here we have shown convergence in the case 
when just one dual step is made in each iteration. The series of algorithms discussed 
in [131 US] mostly make use of a non-explicit step in the iteration, or of the solution of 
a linear system at every iteration. These existing algorithms are often special cases of 
more general methods. The explicit algorithm in [15] is also different. In [161 Eq. 74] the 
authors propose another explicit method using additional dual variables. 

In Korpelevich introduced an extragradient algorithm for the solution of a 

general saddle point problem. It relies on updating two copies of primal and dual variables 
(say {x^,w^) and (x", tD")), and combining the one with the gradient in the other point. 
No distinction is made there between primal and dual variables, as we do here. 

In [25] a proximal point algorithm is introduced that reduces to the Korpelevich al- 
gorithm in a special case. Moreover it was shown that the iterates of that algorithm 
converge 'ergodically' with rate 1/A^ on the primal objective function (i.e. the Cesaro 
means decrease the functional with rate 1/N). It is however assumed there that both 
primal and dual variables are bounded (saddle point problem on a compact domain). 
This is a major difference with our result, where we only require that the dual variable w 
be bounded in order to derive a 1/iV bound on the functional (point E] of theorem [T]); the 
primal variable x is unbounded. 

We did not try to extend the convergence proof to an infinite dimensional setting, as 
was done in p] for the iterative soft-thresholding algorithm. The most useful example 
of problem ([1]) is perhaps the case where A = grad (total variation penalty), but this 
operator is unbounded in the infinite dimensional case. 

Algorithm fl5^ for problem ([T]) can also be used for signal recovery under analysis style 
sparsity requirements [26]: finding x with many {Ax)i equal to zero for a given frame 
operator A. Another application of algorithm ( 139|) is solving a linear inverse problem 
while imposing group sparsity (possibly with overlapping groups) [27]. In this case the 
matrix A is chosen in such a way that ||Aa;||i = Xlfcegroups l(^j)jegroupj.|, i-e A has a single 
1 on each row (and all other elements are zero). Columns may have more than a single 
nonzero entry (this would correspond to overlapping groups). In this expression | ■ | is 
again the Euclidean norm of a vector. 
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Figure 1: Synthetic seismic tomography experiment using total variation penalty (see 
section [7j): (a) input model with both sharp and smooth edges between zones of constant 
model value; (b) reconstruction from 8490 noisy data with 1000 iterations of algorithm 
(j2S]) and A = grad; (c) evolution of the distance to the limit model and of the functional 
to the limit value (here x is obtained from 100000 iterations of the same algorithm; x is 
not equal to x™); (d) Sum of the rows of the matrix K to indicate the illumination of the 
sphere by the rays in the data set. 
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7 Numerical example of total variation minimization 



Algorithm fl39|) is applied to a stylized problem in seismic tomography. We try to recon- 
struct a simple synthetic 2D input model defined on the sphere (see figure [Ha) from 
8490 data. The input model has a number of zones of constant value with either a sharp 
edge or a smooth edge in between. The model space has dimension 98304. The data y 
are found from 8490 seismic surface rays that criss-cross the globe (these correspond to 
actual earthquakes and seismic stations |2B]) and that make up the rows of a matrix K 
(see figure [Hd). More precisely, synthetic data y are constructed through the formula 
y = Kx^^ + e, where e is gaussian noise of magnitude ||e|| = 0.1 x ||i^a;'°|| (i.e. 10% noise). 
The aim is now to reconstruct x'" as well as possible from y by imposing a total variation 
penalty in cost function ([T]). In other words we will look for the minimizer of function ([T]) 
where K and y are given and where we choose A = grad. 

Algorithm ([39]), with r = 0.99/||/r|p and a = 0.99/ 1| A f and 1000 iterations was used 
to produce a reconstruction: = x^^"*^ (in about 20 seconds of computer time). The 
penalty parameter A was chosen to fit the data to the level of the noise: —y\\ = \\e\\. 

The original model and its reconstruction are shown in figured], panels (a) and (b). 
The total variation penalty results in a piece-wise constant output model. Sharp edges 
(e.g. near North America) are reasonably well resolved given the small amount of data 
available. In panel (c) the distance of iterate x" to a reference minimizer x is shown. The 
residual error (with respect to x) after 1000 iterations is 10%, and the functional attains 
about 3 correct decimal compared to the 'true' minimal value. 
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